Add posibility to parse multiple line definitions #378

bosd · 2022-05-25T21:52:27Z

Supersedes #225

Fixes: #155 No warning with lines - Solved in a different manner to show why something matches and on which setting.
Fixes: #224 Support for multiple lines using lines plugin - With addition of line contentating.
Fixes: #186 Support for multiple line groups in a template
Fixes: #238 Is there a support for multiple regex for lines plugin?
Closes: #377 Parse Multiple lines in the same field

Refactored code to allow multiple lines to be parsed.
Restructuring was needed.
As we want to output the lines in the order of the invoice file.
Independent of the order of the template file.

Example yaml input

lines:
 - start: Barcode
   end:  Netto totaal
   line: (?P<line_note>(FOOD))
 - start: Barcode
   end:  Netto totaal
   line: (?P<description>(.+))\s+(?P<price>\d+\d+)

it is also possible to contentate multi line tags.
alternative example sammymaystone.

    lines:
     - start: Item\s+Quantity\s+Rate\s+Amount
       end:  Subtotal
       first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})'
       line: '(?P<item>.*)'
       skip_line: ['Pino \w'] # 'Description:', Notes:,
       last_line: '(?P<desc>Parts:.*)'
       types:
         qty: int
         unitprice: float
         linetotal: float

input:

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

output:
{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}

bosd refactored code to allow multiple lines to be parsed. Restructuring was needed. As we want to output the lines in the order of the invoice file. Independent of the order of the template file. Example yaml input lines: - start: Barcode end: Netto totaal line: (?P<line_note>(FOOD)) - start: Barcode end: Netto totaal line: (?P<description>(.+))\s+(?P<price>\d+\d+) it is also possible to contentate multi line tags. alternative example sammymaystone. lines: - start: Item\s+Quantity\s+Rate\s+Amount end: Subtotal first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})' line: '(?P<item>.*)' skip_line: ['Pino \w'] # 'Description:', Notes:, last_line: '(?P<desc>Parts:.*)' types: qty: int unitprice: float linetotal: float input: Service A 12 0.00 20.00 Description: Repair Notes: Replaced capacitor Parts: 1 x cap_a Tax: 0.2% output: {'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}

m3nu · 2022-05-29T13:48:02Z

Thanks for the contribution! Will test this locally first.

m3nu · 2022-05-29T13:49:59Z

Test should pass. Linting in this case.

bosd · 2022-05-29T22:29:34Z

Flake8 was not willing to install on my local system.

So fixed the errors manually. Could'nt test, but flake8 should be happy now.

bosd · 2022-06-15T09:29:43Z

@m3nu Should be fixed now, can you trigger the tests?

rmilecki · 2022-06-19T19:12:56Z

I'm quite a bit confused by this pull request. I'm not sure what does it do and what are you trying to fix/improve.

A single commit modifying hundreds of lines is hard to follow. It'd be nicer to have few smaller commits that can be reviewed properly.
You referenced issues that may be resolved already (Support for multiple lines using lines plugin #224 actually is)
Your examples and - what's worse - updated documentation uses the old syntax, please use the parser: lines form

Please start with explaining your case.

Are you dealing with multiple invoice sections that need to be parsed with lines?
Are you dealing with a single invoice section that contains lines of multiple formats?

That example you provided ("Service A") could be parsed without this pull request so it doesn't really help to understand your changes.

bosd · 2022-06-19T22:33:46Z

I'm quite a bit confused by this pull request. I'm not sure what does it do and what are you trying to fix/improve.

This contribution allow to parse all the invoice lines against different regexes.

I'm sorry, I was unable to do it in a smaller commit.
Basically it is a continuation of 225, but then ported to your contribution of the lines refactoring.
Simply put, each invoice line is tested against a setting which is calling the parser.

Invoice2data is used as a module for the importing of invoices in Odoo.
Previously it was not possible to import multiple lines in the desired format.
This also explains the contentation function. Will try to explain better.
In the input we have

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

Let's call that for now a section.

That section consists of multiple lines.
Lets focus on the description elements. Suppose in the ERP we need the following:
In the tag
item we need a Multiline description, by which the lines are separated by \n
Earlier it was not possible to achieve that. But with this commit we can generate in the lines tag:
'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'
Thus, information parsed from multiple input lines, can be combined in a single tag.

With the alternative syntax, introduced by #308 (what you call new) it was not possible to create the desired output.
Therefore I would vote against the "new" syntax, because of breaking compatibility. (There are users with their own private templates)
And in my opinion it would be more difficult to generate the template lines from template creator.
(A template creator module is something I am working on, currently only exists in notes and in my mind)
If it was my decision, I would revert the "new" syntax. But hey, it is open source. It was quite cumbersome to also support the newly introduced alternative syntax. But decided to put in the extra hours, as maybe some people beside yourself started using it. So now users will have a choice which one to use.

In the "old" syntax all the line info was captured in the lines tag.
However, by the introduction of the "new" syntax it is possible to choose your own tag.
Like in the provided example com.sammymaystone.yaml. Instead us using

   lines:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<desc>.*)'
    skip_line: ['Description:', "Notes:"]
    last_line: '(?P<desc>Parts:.*)'
    end: 'Subtotal'

it reads (note the line_items ):

  line_items:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<desc>.*)'
    skip_line: ['Description:', "Notes:"]
    last_line: '(?P<desc>Parts:.*)'
    end: 'Subtotal'

The added flexibility may look as an improvement on first hand.
IMO, it is one step further away from a standard solution.
Now we get users using the lines tag and line_items and whatever variation users come up with.
Thus templates are not that easy to be universally transferred.
This is more an issue, for the "standard field names" discussion / PR but it is closely related/
If we want to translate the output from this module to an ERP in the field lines_erp, now to create a mapping entry for
lines to lines_erp
line_items to lines_erp
whatever_someone_makes_up to lines_erp

Sorry, getting a bit late at night typing this. I don't fully recall how the response is with your "new" syntax and contentating.
As in my use case, what you call the "old" syntax is used. In the output I need only one lines tag containing all the line items.
(see first post).
Not:

{'lines': {'all items matching the first parser call'}}

{'lines': {'all items matching the second parser call'}}

To add we want the lines output to be in the same order as they where found in the original input file.
So as the ERP re-assembles the invoice, we can do a full side by side compare and all the line items are matching. 👍

Didn't have the time to do it. But I really want to revert back the "new" syntax.
Or at least the "bad" sammy maystone example. As people may start to use it as a base for creating their own templates and only adding to the confusion.

Hope this is more clear. It's Getting to late now, will continue/improve this post later on 😪 💤

bosd · 2022-06-20T05:57:13Z

With this PR it is also possible to use multiple line groups in the configuration.
Using the "alternative" syntax that is not possible, limited by Yaml syntax constraint.
Yaml does not allow the same key to be used multiple times.

Quick example of correct syntax for multiple line configurations.

lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(FOOD))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)

m3nu · 2022-06-23T13:18:31Z

@m3nu Should be fixed now, can you trigger the tests?

Check running. Sorry for the delay.

bosd · 2022-06-24T07:58:34Z

Check running. Sorry for the delay.

NP!

The CI tests are failling. Gonna need some help and pointers why that's the case.
As I could not reproduce it locally.

Local tests are failling for different reasons:
Import Error on local unittests #380
FilenotfoundError #381
pdfminer not installed #379

m3nu · 2022-06-24T09:05:29Z

One is a failed test comparison. Maybe a different extraction result.

The other is a missing variable. Probably a typo.

>               content_of_setting = content[start.end() : end.start()]
E               AttributeError: 'NoneType' object has no attribute 'end'

https://github.com/invoice-x/invoice2data/runs/7024293832?check_suite_focus=true

bosd · 2022-06-24T18:44:34Z

local tests are passing, with some code changes.
(path related, will create PR)
But I get the impression that the local tests I'm running are slightly different from the CI tests.

So it's a bit of trail and error. Hope you don't mind triggering the CI a couple of times. In case the latest commit doesn't fix it.

rmilecki · 2022-06-26T21:41:27Z

@bosd: I feel you really don't like the new syntax, I'll have to argue a bit here.

I would vote against the "new" syntax, because of breaking compatibility

While I did my best not to beak backward compatibility, you claim I did. Please create a new issue providing details of YAML that stopped working for you after my changes. It's really important for me.

it would be more difficult to generate the template lines from template creator

I can't treat it as a serious argument against new syntax. It's about some unpublihsed software I can't even test. I can't imagine why it's so hard to generate a different YAML. Maybe your software needs some refactoring.

If it was my decision, I would revert the "new" syntax. But hey, it is open source.

Exactly, it's open source, please suggest a better soltuion for problems I attempted to resolve (resolved?) with my new syntax. Just reverting is not a solution.

However, by the introduction of the "new" syntax it is possible to choose your own tag.
Thus templates are not that easy to be universally transferred.

Invoice fields are not universally named. Line fields are not universally named. Yet you blame me for dropping one ~~generic~~ standarized (forced) field name. Can't you see the real problem is way bigger and we need standarized names for way more fields than just lines?

It feels annoying to have you blaming one feature (syntax) and keeping that semi off-topic discussion here. If at least you could suggest another solution. I wasn't the only person what wanted reusable lines parser, but it seems you didn't even care to read referenced issues. #186 and #224 which you claim to fix were actually fixed by introducing parsers syntax.

rmilecki · 2022-06-26T21:48:29Z

You spent quite some time criticizing the new syntax but not so much explaining your changes. Can we take a step back and discuss what do we need to support, please?

You provided a following example of input invoice:

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

and expected parsing output:

 'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'

I took a moment to create a fake invoice with following content:

Item Quantity Rate Amount
Service A                     12   0.00   20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%
Subtotal 20.00

and wrote a simple YAML with following lines entry:

  lines:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<item>.*)'
    skip_line: [ 'Parts:', 'Tax:' ]
    end: 'Subtotal'

parsing results in:

'lines': [{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'}],

which is exactly what you expected.

So no changes to invoice2data are needed to parse that example invoice section you provided.

Can you provide an actual example of invoice section that:

Can't be parsed with existing invoice2data code
Can be supported with your changes

bosd · 2022-06-27T06:35:57Z

While I did my best not to beak backward compatibility, you claim I did.

Sorry, Should've been more clear. It is not broken. yet..
Moi point is aimed more at a comment the "old" syntax should be phased out in a year.

Can't you see the real problem is way bigger and we need standarized names for way more fields than just lines?

Yes, I agree the problems are bigger. But can't solve them all at once :) More PR's are coming.

I wasn't the only person what wanted reusable lines parser, but it seems you didn't even care to read referenced issues. #186 and #224 which you claim to fix were actually fixed by introducing parsers syntax.

I've looked in into those issues. Yet they did not resolve it for me.
I'll have to leave it at that for now as I'm just leaving a quick reply while commuting. Will do some more tests and provide an more extensive answer later on.

Can you provide an actual example of invoice section that:

Can't be parsed with existing invoice2data code

Can be supported with your changes

Yes, Will do

Ideally I would like to add more "advanced" examples.
(and an example included in this PR.)
I've made a set of templates but they are not ready to be committed yet. This PR is just a small step in getting them processed.
There is still some more work needed like:

Standard field names
Ability to parse HTML style table invoices

It's coming, just need more dev time 😃

rmilecki · 2022-06-28T05:16:10Z

Moi point is aimed more at a comment the "old" syntax should be phased out in a year.

Good point! Given how little maintainance it takes to keep old syntax support I think we should keep it as long as it doesn't cause serious development / maintenance problems.

Yes, Will do

Ideally I would like to add more "advanced" examples.
(and an example included in this PR.)

Perfect. I think we must have a real example to understand why this change is needed and review it properly. For me it's a really bad idea to merge code that we don't know what it's supposed to solve.

bosd · 2022-06-28T10:23:17Z

Ok, going back to the original issue in #377
Parse Multiple lines in the same field.

When passing multiple line arguments in the template I won't get the desired outcome.
Made an fictitious Example of an invoice which cannot be parsed. It's inspired by a real wholesaler invoice.
Mekro.pdf

Template (old syntax):

# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---FOOD ITEMS---))
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---OTHER ITEMS---))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

While running it against the code before this pr:

'''
Traceback (most recent call last):
File "/home/bosd/.local/bin/invoice2data", line 10, in
sys.exit(main())
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/main.py", line 212, in main
res = extract_data(f.name, templates=templates, input_module=input_module)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/main.py", line 95, in extract_data
return t.extract(optimized_str)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/invoice_template.py", line 218, in extract
plugin_func.extract(self, optimized_str, output)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/plugins/lines.py", line 13, in extract
lines = parsers.lines.parse(self, "lines", self["lines"], content)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/parsers/lines.py", line 20, in parse
settings.update(_settings)
ValueError: dictionary update sequence element #0 has length 3; 2 is required

'''

Ok, that template was using the 'old' syntax.
Let's do it again with the 'new' syntax template.

# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
## new test here
  lines:
    parser: lines
    start: Barcode
    line: (?P<line_note>(---FOOD ITEMS---))
    end: Netto totaal
  lines:
    parser: lines
    start: Barcode
    line: (?P<line_note>(---OTHER ITEMS---))
    end: Netto totaal
  lines:
    parser: lines
    start: Barcode
    line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
    end: Netto totaal
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

Result:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

It parses 🎉
Much better, but it only parses using the last line setting in the template file.
Note The following is missing:
---FOOD ITEMS---
---OTHER ITEMS---
Thus it only seems to parse the last line definition block from our template.
(Actually it parses all, but overwrites the output.)

Now, lets use the "new syntax" template with the code in this pr:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

🎉 No Difference!

Ok, new lets use the "old" syntax template with the code from this PR:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

✨🥇 Now we have a full json representation of the lines from the input file.

rmilecki · 2022-08-26T07:41:00Z

I'll try to review this this Sunday (Cc @m3nu)

bosd · 2022-09-19T09:24:02Z

This one depends upon: #394

bosd refactored code to allow multiple lines to be parsed. Restructuring was needed. As we want to output the lines in the order of the invoice file. Independent of the order of the template file. Example yaml input lines: - start: Barcode end: Netto totaal line: (?P<line_note>(FOOD)) - start: Barcode end: Netto totaal line: (?P<description>(.+))\s+(?P<price>\d+\d+) it is also possible to contentate multi line tags. alternative example sammymaystone. lines: - start: Item\s+Quantity\s+Rate\s+Amount end: Subtotal first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})' line: '(?P<item>.*)' skip_line: ['Pino \w'] # 'Description:', Notes:, last_line: '(?P<desc>Parts:.*)' types: qty: int unitprice: float linetotal: float input: Service A 12 0.00 20.00 Description: Repair Notes: Replaced capacitor Parts: 1 x cap_a Tax: 0.2% output: {'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}

bosd · 2022-09-24T13:15:54Z

Local tests are ok. Odd that there is a difference in tests.

bosd · 2022-09-24T21:59:39Z

Another test solved! ✨
Only pdfminer issue left #402

rmilecki · 2022-09-25T10:26:30Z

This pull request rewrites a lot of src/invoice2data/extract/parsers/lines.py code and is really hard for me to review properly. It also modifies parsers code to while focusing on syntax handled by the plugins code. I think those changes don't really belong here.

@bosd I know you spent a lot of time on this but could you please consider my possibly an alternative soluion: #407 , please? My changes should implement the same feature by just adding 15 new lines to the lines.py parser.
On top of my changes you can easily extend lines plugin syntax if you think we really need that, see #408

bosd · 2022-09-25T10:47:13Z

This pull request rewrites a lot of src/invoice2data/extract/parsers/lines.py code and is really hard for me to review properly.

True, it's quite a rewrite. It could even be further improved by dumping all the lines in an array. Possibly that would elliminate some of the for loops. (for loops are known for slowing python down)
But at this point I'm not willing to do that. As I already spent lot of time getting this functionality. (for now my focus is on getting data properly parsed instead of speed)

This one is ready! (Apart from the messy commits, propably easier to make a new branch and pr)

It also modifies parsers code to while focusing on syntax handled by the plugins code. I think those changes don't really belong here.
Can you explain this a bit more? I don't understnd what it modifies on the parser code.

Will have a look at those PR's 😄

rmilecki · 2022-09-26T06:44:09Z

I've clearly missed some requirements, it seems we have a bit different cases after all.

I'll spend some time analyzing this today and tomorrow and come back to you. Sorry for this confusion.

rmilecki · 2022-10-01T21:15:07Z

Ok, going back to the original issue in #377 Parse Multiple lines in the same field.

When passing multiple line arguments in the template I won't get the desired outcome. Made an fictitious Example of an invoice which cannot be parsed. It's inspired by a real wholesaler invoice.

Thank you @bosd for this complete explanation of your case. I carefully read that and I believe I finally understand it fully.

By looking at your desired output it seems that you need to:

Parse 1 specific part of invoice containing multiple lines
Lines use two different formats

That syntax you proposed:

lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---FOOD ITEMS---))
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---OTHER ITEMS---))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)

is quite confusing in my opinion. It's because you have there multiple (three) & repeated start and end entries. That suggests you want to parse 3 different blocks (parts) of invoice. As the same time you want them to be merged and put in the right order.

It's really hard for me to guess what should happen if you use start and end RegEx-es that match different parts of invoice. Parsing them would be fine but how to merge them in user-expected way?

It turns out it's already possible to parse your Mekro.pdf without any modifications to the project. I used:

fields:
  lines:
    parser: lines
    start: Barcode
    line: ((?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+))|(?P<line_note>---(.*ITEMS)---)
    end: Netto totaal

(the trick is to use | in the lines RegEx).

That may not be perfect but gives output close to what you expected:

"lines": [
    {
        "barcode": "",
        "name": "",
        "qty": "",
        "uom": "",
        "price_unit": "",
        "discount": "",
        "line_note": "---FOOD ITEMS---"
    },
    {
        "barcode": "2231012001992",
        "name": "KROKETBROODJES",
        "qty": "2",
        "uom": "KG",
        "price_unit": "1,00",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "8713009019455",
        "name": "Oil",
        "qty": "3",
        "uom": "L",
        "price_unit": "0,50",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "8713009019475",
        "name": "Apple",
        "qty": "1",
        "uom": "KG",
        "price_unit": "50,0",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "",
        "name": "",
        "qty": "",
        "uom": "",
        "price_unit": "",
        "discount": "",
        "line_note": "---OTHER ITEMS---"
    },
    {
        "barcode": "8713009019375",
        "name": "programmer",
        "qty": "1",
        "uom": "Hour",
        "price_unit": "50,0",
        "discount": "100,0",
        "line_note": ""
    },
    {
        "barcode": "0013009019475",
        "name": "Sticker",
        "qty": "1",
        "uom": "pce",
        "price_unit": "0,0",
        "discount": "0,0",
        "line_note": ""
    }
]

I admit that using | in RegEx doesn't result in clean (easy to read) RegEx and it also produces unwated (empty) entries in output JSON. For that I think we could improve our lines parser.

I think we should just allow multiple entries in the line. That would result in:

Simple implementation (just few extra lines of code)
Clear syntax (no multiple start and end for parsing one block (part) of invoice)

bosd · 2022-10-03T06:01:03Z

By looking at your desired output it seems that you need to:

Parse 1 specific part of invoice containing multiple lines
Lines use two different formats

I think you understand it correctly. The usage of the | as suggested here is interesting.
Yet, the result is not compatible with the code used to further process the results from this library.

Ooh, yeah, I see the multiple start and end's can be confusing.
(but I think is easy syntax, for converting a table style template builder as input. Yet, thats not implemented in code, so we can ignore that for now)

Kinda hard to put it into full spec.

Parse multiple line formats in order as they appear in the input
Still allow to contentate values if multiple regex rules match the field. (Do contentate untill a new firstline is received)
Output order should be independent from the order of the template. (it should take order from the input file) (nice to have)

(Contentate is not used in the above example)

I'll have a look at the mentioned pr

bosd · 2022-10-21T21:51:01Z

Superseded by #417

Fixup! Flake8

daec273

Fixup2! Flake8

fbf63cc

rmilecki mentioned this pull request Jun 19, 2022

added ability to define multiple line definitions #225

Closed

Fixup! Stop line matching if there is no start or end defined

3036565

Fixup parse multiple first_line's sequentially

0faae0a

bosd force-pushed the multiline branch from 562435b to 0faae0a Compare June 28, 2022 10:03

bosd mentioned this pull request Jun 28, 2022

Json output in wrong order #383

Closed

bosd force-pushed the multiline branch from 0faae0a to cf78619 Compare August 26, 2022 07:24

bosd force-pushed the multiline branch 3 times, most recently from fff0526 to 5048cdf Compare September 24, 2022 11:24

bosd added 5 commits September 24, 2022 14:11

Fixup! Flake8

1a0d473

Fixup2! Flake8

bfb0410

Fixup! Stop line matching if there is no start or end defined

1ba4cab

Fixup parse multiple first_line's sequentially

fde98ed

bosd force-pushed the multiline branch from 5048cdf to fde98ed Compare September 24, 2022 12:11

bosd added 2 commits September 24, 2022 17:11

Merge branch 'invoice-x:master' into multiline

2d93234

Merge branch 'multiline' of github.com:bosd/invoice2data into multiline

41caefc

bosd mentioned this pull request Sep 24, 2022

addition to support multi page invoices #372

Closed

bosd added this to the 0.4.0 release milestone Sep 24, 2022

bosd requested review from m3nu and rmilecki September 24, 2022 18:11

update compare QualityHosting.json

ad6ee6d

fixup when no content is matched between start and end

436ade8

bosd force-pushed the multiline branch from 3516096 to 436ade8 Compare September 25, 2022 09:58

Merge branch 'invoice-x:master' into multiline

3f29118

rmilecki mentioned this pull request Sep 25, 2022

plugins: lines: add support for sets of regexes to legacy syntax #408

Closed

bosd mentioned this pull request Sep 25, 2022

parsers: lines: support "rules" for multiple sets of regexes #407

Closed

rmilecki mentioned this pull request Oct 1, 2022

Lines: support multiple RegEx-es in lines entries #417

Merged

bosd mentioned this pull request Oct 21, 2022

parsers: lines: support multiple occurrence of blocks to parse #423

Merged

bosd closed this Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add posibility to parse multiple line definitions #378

Add posibility to parse multiple line definitions #378

bosd commented May 25, 2022 •

edited by rmilecki

Loading

m3nu commented May 29, 2022

m3nu commented May 29, 2022

bosd commented May 29, 2022

bosd commented Jun 15, 2022

rmilecki commented Jun 19, 2022

bosd commented Jun 19, 2022

bosd commented Jun 20, 2022 •

edited

Loading

m3nu commented Jun 23, 2022

bosd commented Jun 24, 2022

m3nu commented Jun 24, 2022

bosd commented Jun 24, 2022

rmilecki commented Jun 26, 2022 •

edited

Loading

rmilecki commented Jun 26, 2022

bosd commented Jun 27, 2022

rmilecki commented Jun 28, 2022

bosd commented Jun 28, 2022

rmilecki commented Aug 26, 2022

bosd commented Sep 19, 2022

bosd commented Sep 24, 2022

bosd commented Sep 24, 2022

rmilecki commented Sep 25, 2022 •

edited

Loading

bosd commented Sep 25, 2022 •

edited

Loading

rmilecki commented Sep 26, 2022

rmilecki commented Oct 1, 2022

bosd commented Oct 3, 2022

bosd commented Oct 21, 2022

Add posibility to parse multiple line definitions #378

Add posibility to parse multiple line definitions #378

Conversation

bosd commented May 25, 2022 • edited by rmilecki Loading

m3nu commented May 29, 2022

m3nu commented May 29, 2022

bosd commented May 29, 2022

bosd commented Jun 15, 2022

rmilecki commented Jun 19, 2022

bosd commented Jun 19, 2022

bosd commented Jun 20, 2022 • edited Loading

m3nu commented Jun 23, 2022

bosd commented Jun 24, 2022

m3nu commented Jun 24, 2022

bosd commented Jun 24, 2022

rmilecki commented Jun 26, 2022 • edited Loading

rmilecki commented Jun 26, 2022

bosd commented Jun 27, 2022

rmilecki commented Jun 28, 2022

bosd commented Jun 28, 2022

rmilecki commented Aug 26, 2022

bosd commented Sep 19, 2022

bosd commented Sep 24, 2022

bosd commented Sep 24, 2022

rmilecki commented Sep 25, 2022 • edited Loading

bosd commented Sep 25, 2022 • edited Loading

rmilecki commented Sep 26, 2022

rmilecki commented Oct 1, 2022

bosd commented Oct 3, 2022

bosd commented Oct 21, 2022

bosd commented May 25, 2022 •

edited by rmilecki

Loading

bosd commented Jun 20, 2022 •

edited

Loading

rmilecki commented Jun 26, 2022 •

edited

Loading

rmilecki commented Sep 25, 2022 •

edited

Loading

bosd commented Sep 25, 2022 •

edited

Loading